Using projected timestamps to control the sequencing of file modifications in distributed filesystems

ABSTRACT

It is not possible to maintain extremely tight synchronization of the time keeping clocks of the networked nodes comprising a distributed filesystem. However, when multiple client systems access the same file from different remote locations, a distributed consistency mechanism must ensure that all file read and write requests are only serviced from the latest version of the file. The current industry practice is to disable client-side caching when a concurrent write sharing condition arises (multiple clients active on the file and at least one of them writing). This forces all requests to flow through to the file server and consistency is maintained since all requests are then serviced from the same file image. The current practice sacrifices performance and scalability to maintain consistency. This document discloses methods for projecting and maintaining temporary filesystem timestamps that allow file read and write requests to be serviced from remote cached file images while still providing the same file consistency as the current industry practice. The temporary filesystem timestamps are updated to real filesystem timestamps whenever the client-side cache communicates with the file server.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of and priority to U.S. ProvisionalPatent Application Ser. No. 61/666,597 filed on Jun. 29, 2012, whichapplication is incorporated herein by reference in its entirety.

This application is related to co-pending U.S. application Ser. No.______, filed on Jun. 28, 2013, and entitled, “RECURSIVE ASCENT NETWORKLINK FAILURE NOTIFICATIONS” (Attorney Docket No. 10284.14), whichapplication is incorporated herein by reference in its entirety.

This application is related to co-pending U.S. application Ser. No.______, filed on Jun. 28, 2013, and entitled, “DISTRIBUTED FILESYSTEMATOMIC FLUSH TRANSACTIONS” (Attorney Docket No. 10284.15), whichapplication is incorporated herein by reference in its entirety.

This application is related to co-pending U.S. application Ser. No.______, filed on Jun. 28, 2013, and entitled, “METHOD OF CREATING PATHSIGNATURES TO FACILITATE THE RECOVERY FROM NETWORK LINK FAILURES”(Attorney Docket No. 10284.17), which application is incorporated hereinby reference in its entirety.

BACKGROUND OF THE INVENTION

The Distributed Data Service (DDS) architecture provides a framework forhighly distributed, hierarchical, multi-protocol caching. DDS is adistributed caching layer that spans an enterprise's network,encompassing multiple LANs interconnected with WAN links. The cachinglayer's constituent parts are DDS modules installed on file servers,client workstations, and intermediate nodes (routers, switches, andcomputers). DDS employs TCP/IP for inter-site communications and maytherefore be incrementally deployed. Non-DDS nodes appear as “just partof the wire”.

Conceptually, the DDS caching layer slices through each DDS configuredcomputer system at the vnode interface layer. File systems (UFS, VxFS,NTFS, EXT4) and other devices such as video sources and shared memoryplug into the bottom of the caching layer and provide permanent filestorage or a data sourcing/sinking capability. Client systems plug intothe top of the caching layer to access “local” data. Distributedthroughout the network, intermediate DDS nodes (routers, switches, andother computers) provide increased scalability and faster file access.

The DDS layer implements an intelligent integrated data streaming andcaching mechanism to make file data appear as “local” as possible. Whena client process accesses a file, the file appears to be local (in termsof file access performance) if it has been accessed before and has notbeen modified since its last access. When file data must be fetched fromthe origin server, DDS pre-fetches file data in advance of the client'srequest stream. Of course, pre-fetching is only performed forwell-behaved clients. Write behind is also implemented by DDS in amanner consistent with the fact that users aren't very tolerant of filesystems that lose their data.

Data cached within the DDS layer is stored in a protocol neutral formatin a manner that requires no “translation” when the client is of thesame type (Unix, Windows, . . . ) as the origin server.

The DDS layer maintains “UFS consistency” (a read always returns themost recently written data) on cached images and provides severalmethods of handling and recovering from network partitioning events. Tothe maximum extent possible, recovery and reconnection is performedautomatically with no requirement for user or administratorintervention.

This document discloses and explains the methods and procedures employedby DDS to transparently overcome network infrastructure failures. Inthis context, “transparent” means that when an intermediate network nodeor link fails during a DDS file access operation, an alternate path tothe origin server is discovered and used to complete the operationwithout the client or server ever even becoming aware of the networkfailure.

BRIEF SUMMARY OF SOME EXAMPLE EMBODIMENTS

This Summary is provided to introduce a selection of concepts in asimplified form that are further described below in the DetailedDescription. This Summary is not intended to identify key features oressential characteristics of the claimed subject matter, nor is itintended to be used as an aid in determining the scope of the claimedsubject matter.

One example embodiment includes a computing system where a data requesthas been passed between a file service proxy cache node and a downstreamsite, the file service proxy cache node being a network node locatedbetween a client system and the origin file system node, anon-transitory computer-readable storage medium including instructionsthat, when executed by the file service proxy cache node, performs thesteps dispatching a file access request to the downstream site. Theinstructions also perform the step receiving a response to the fileaccess request. The response includes a version number of a file imagecached at the downstream site. The instructions further perform the stepcomparing the version number of the file image cached at the downstreamsite to a version number of a file image cached at the file serviceproxy cache node. The instructions additionally perform the step if theversion numbers are the same continuing to use the file image cached atthe file service proxy cache node. The instructions moreover includeperform the step if the version numbers are different: setting theversion number of the file image cached at the file service proxy cachenode to the version number of the file image cached at the downstreamsite; setting a version number arrival time to the file service proxycache node's current time; and resetting an delta time to zero. Theinstructions also perform the step if the response to the file accessrequest is not a flush response discarding the current cached fileimage.

Another example embodiment includes a computing system where a flushrequest has been received at a file service proxy cache node from anupstream file service proxy cache node, the file service proxy cachenode being a network node located between a client system and the originfile system node, a non-transitory computer-readable storage mediumincluding instructions that, when executed by the file service proxycache node, performs the step receiving a flush request from an upstreamfile service proxy cache node. The instructions also perform the stepcomparing the version number of the file image cached at the upstreamsite to the version number of the file image cached at the file serviceproxy cache node. The instructions further perform the step if theversion numbers do not differ storing the received flush data in shadowextents. The instructions additionally perform the step determining ifthe file service proxy cache node is a server terminator site. Theinstructions moreover perform the step if the file service proxy cachenode is a server terminator site and if all flush batches have beenreceived and at least one of the flush batches has a data completerequest flag set determining whether synchronous write mode is beingused for the file identified in the request. The instructions alsoperform the step if synchronous write mode is being used writing allfile modifications to the underlying origin filesystem and determiningwhether the write to the origin filesystem is successful. Theinstructions further perform the step if the write to the originfilesystem is successful: setting an all data received response flag andreplacing all extents that have shadow extents with their respectiveshadow extents; fetching the file's last modification timestamp from theorigin filesystem; and responding to the received flush request with astatus code that indicates the successful completion of the request. Theinstructions additionally perform the step if all flush batches have notbeen received responding to the received flush request with a statuscode that indicates the successful completion of the request.

These and other objects and features of the present invention willbecome more fully apparent from the following description and appendedclaims, or may be learned by the practice of the invention as set forthhereinafter.

BRIEF DESCRIPTION OF THE DRAWINGS

To further clarify various aspects of some example embodiments of thepresent invention, a more particular description of the invention willbe rendered by reference to specific embodiments thereof which areillustrated in the appended drawings. It is appreciated that thesedrawings depict only illustrated embodiments of the invention and aretherefore not to be considered limiting of its scope. The invention willbe described and explained with additional specificity and detailthrough the use of the accompanying drawings in which:

FIG. 1 illustrates a DDS virtual file server constructed with a flatnetwork topology;

FIG. 2 depicts the flat network of FIG. 1 reconfigured to provide highavailability access to the eng and sales sub-domains, and another nodehas been added to create redundant network paths;

FIG. 3 depicts the /etc/dds_exports file (also referred to as the sitemap file) for node 4, which exports the eng sub-domain;

FIG. 4, illustrates the domain map file for node 1;

FIG. 5 illustrates the export filesystem (site tree) constructed byacme-4 by following the directions contained in the site map file;

FIG. 6, illustrates the domain tree for node 1;

FIG. 7, depicts a simple hierarchy of DDS nodes configured with multipleroutes from node A to the origin server node (node S);

FIG. 8 is a flowchart illustrating an example of a method 800 ofrecursive ascent failure notifications;

FIG. 9 is a flowchart illustrating an example of a method of atomicflush transactions;

FIG. 10 is a flowchart illustrating an example of a method of validatinga cached file image when a response received at an upstream sitecontains the same version number as the one associated with the cachedfile image;

FIG. 11 is a flowchart illustrating an example of a method of flushingmodified file data downstream towards the DDS server terminator site;

FIG. 12 is a flowchart illustrating an example of a method employed by aDDS node to load a path signature when a connection is established; and

FIG. 13 is a flowchart illustrating an example of a method employed toreconnect an upstream DDS node.

DETAILED DESCRIPTION OF SOME EXAMPLE EMBODIMENTS

Reference will now be made to the figures wherein like structures willbe provided with like reference designations. It is understood that thefigures are diagrammatic and schematic representations of someembodiments of the invention, and are not limiting of the presentinvention.

DDS Domain Architecture

A DDS network may be configured into either a flat or hierarchicalorganization. Hierarchical topologies inherently provide more latitudefor constructing networks incorporating redundant paths. When networkfailures do occur, DDS employs redundant network paths to transparentlyre-route file access traffic around the failure.

FIG. 1 illustrates a DDS virtual file server constructed with a flatnetwork topology. The virtual file server exports the acme domain,consisting of four sub-domains: corp, mrkt, sales and eng. The virtualfile server appears to be a single multi-homed (multiple ip addresses)Windows or Linux file server to client workstations. The public networkinterfaces are DNS registered with the names acme-1, acme-2, acme-3 andacme-4. Workstations may use any of the four public interfaces and mayswitch to using a different interface at any time.

When all components of the virtual file server are operating properly,the “view through any portal” is equivalent to the view through anyother portal. DDS's distributed consistency mechanism works behind thescenes to ensure this consistency of views. However, when a network nodefails, a flat network topology does not provide the redundant pathsnecessary to provide transparent uninterrupted access to all file serverdata.

FIG. 2 depicts the flat network of FIG. 1 reconfigured to provide highavailability access to the eng and sales sub-domains, and another nodehas been added to create redundant network paths.

Nodes 3 and 4 have shared access to a storage area network (SAN)containing the filesystems exported through the eng and salessub-domains. These two nodes are configured such that if either nodefails, the other node will notice the failure and mount and export thefailed node's filesystems. This is a standard high availability fileserver mechanism and products are available from several sources, bothsupported and freeware.

Another node has been added to the configuration depicted in FIG. 1.Node 5, named acme-5, does not export any of its own filesystems. It isa domain manager providing the same “portal view” as the other fournodes. However, it also serves as the DDS gateway for remote clients andremote DDS workgroup accelerators.

DDS nodes operate independently and cooperatively to create ahierarchical global namespace. During its initialization phase a DDSnode constructs a site tree as specified by the /etc/dds_exports file.FIG. 3 depicts the /etc/dds_exports file (also referred to as the sitemap file) for node 4, which exports the eng sub-domain.

After constructing the site tree, the DDS node constructs a domain treeas specified by the /._dds_./._site_./._control_./._map_. file. FIG. 4,illustrating the domain map file for node 1, specifies:

-   -   the domain's name is acme,    -   the domain has four domain managers: acme-1, acme-2, acme-3 and        acme-4,    -   the domain has four sub-domains:        -   n1 exporting the corp sub-domain,        -   n2 exporting the mrkt sub-domain,        -   n3 exporting the sales sub-domain, and        -   n4 exporting the eng sub-domain.

The names used in a domain map file may be public names (registered witha network name service such as DNS) or they may be private. In thisexample, n1 through n4 are private names and acme-1 through acme-4 arepublic names. The domain nodes communicate with each other using thedomain's private names, and clients of the DDS virtual file server usethe domain's public names. A node's private name and public name mayresolve to the same IP address or to different IP addresses.

Once initialization is complete, the node exports these two trees. Thesite tree is a single filesystem containing all content exported by thisnode. The domain tree is a single filesystem containing all contentexported by all sub-domain nodes for which this node is a domainmanager.

A DDS node may host:

-   -   An atomic domain—The node has a /etc/dds_exports file but there        is no /._dds_./._site_./._control_./._map_. file. An atomic        domain does not contain any sub-domains. The node's site tree is        also its domain tree (by way of a symbolic link).    -   A non-atomic domain—The node has a        /._dds_./._site_./._control_./._map_. file which specifies the        domain's name and the names of all sub-domains. The node might        not have a /etc/dds_exports file, in which case the node does        not contribute any content to the domain for which it was a        domain manager. Or, the node might have a site tree (specified        by the /etc/dds_exports file) that may or may not be included in        its domain tree.

DDS Initialization

The initialization process for a DDS node occurs in two phases:

-   -   Phase 1—Site initialization. (Refer to FIG. 3 and FIG. 5) The        site map file (/etc/dds_exports) is read to determine the        exported filesystems and the policy attributes associated with        each exported filesystem. On a per filesystem basis, the policy        attributes provide the default policies associated with every        file within the filesystem. Default policies can be overridden        by directory level and/or file level policy attributes.    -    FIG. 3, depicting the site map file for node 4, specifies that        the node will export a filesystem with the branches        /export/eng/hw and /export/eng/sw, both with the same policy        attributes of “*(rw,sync,wdelay,root_squash)” FIG. 5 illustrates        the export filesystem (site tree) constructed by acme-4 by        following the directions contained in the site map file. FIG. 5        also shows portions of acme-4's filesystem not being exported;        they are interconnected with dashed lines instead of solid        lines.    -   Phase 2—Domain initialization. (Refer to FIGS. 4 and 6) After        DDS site initialization completes, DDS checks for the presence        of a domain map file (/._dds_./._site_./._control_./._map_.). If        present, this file informs the DDS instance that it is a domain        manager and it also specifies the name of the domain, the names        of all sub-domains and the names of all sub-domain managers.    -    FIG. 4 is the domain map file for the acme domain depicted in        FIG. 1 and FIG. 6 illustrates the acme domain constructed by        each node configured with that domain map file.    -    Having discovered their domain map files, each DDS node        constructs the acme domain tree by requesting from each        sub-domain node the root of that node's domain tree. The        returned roots are then grafted onto the root of the host node's        domain tree.    -    Referring to FIG. 6, the acme domain contains the sub-domains        corp, mrkt, sales and eng. The eng sub-domain shows additional        detail (the site tree exported by node 4), but the respective        site trees of the other nodes are not depicted. The acme domain        has four portals (acme-1, acme-2, acme-3, acme-4 hosted        respectively on nodes 1 through 4), and each portal has a path        to every sub-domain.

Note that the site map file specifies what is exported, but not what itis called. The domain map file (FIG. 4) specifies that n4 (private namefor node 4) supplies the eng sub-domain, but it is the site map file(FIG. 3) that defines what node 4 will export under the name eng. So,the pathname employed by a user on a client workstation to access anacme hardware engineering document would look something like:/dds/acme/eng/hw/the_document.doc.

After each DDS node completes constructing its domain tree it is openfor business. The multi-homed virtual file server acme may now beaccessed through any of it four network interfaces. The top leveldirectory structure of the acme's exported domain tree is:

/dds/acme/corp/... /dds/acme/mrkt/... /dds/acme/sales/.../dds/acme/eng/...

Clients may now direct their requests to any acme portal and expect toreceive the same response.

During Phase 2, domain initialization, a DDS node constructs a globalnamespace that includes its exported filesystem and the exportedfilesystems of all of its sub-domains. Following initialization, the DDSnode is a file access portal to all files and directories containedwithin the domain's global namespace. The DDS node may also be boundinto a larger domain as a sub-domain of a ‘higher level’ domain. Thisprocess may be recursively repeated until there is a single Internetdomain that encompasses content from thousands or millions of originservers distributed throughout the world.

Using the process described above to construct the multi-homed acmedomain, thousands of DDS nodes may initialize to become Internet domainmanagers (portals). So, a portal located anywhere in the world mayprovide access to content distributed about the globe.

DDS Global File Services

DDS employs extensive file level hierarchical caching to make dataappear to be ‘here’ rather than ‘there’. From a filesystem perspective,a phrase that encapsulates DDS's primary focus is DDS removes thedistinction between local files and remote files.

The distinctions removed are:

-   -   Latency and Bandwidth—an image of the file is cached locally and        therefore can be accessed al “local” speeds.    -   Consistency—read operations always return the most recently        written data.    -   Security—file data flowing and cached within DDS networks is        encrypted and the content owner maintains complete control over        its content throughout the distribution network. All DDS portals        faithfully follow the content owner's instructions (which are        attached to the content as policy attributes) with regard to        providing access to unencrypted content.    -   Availability—DDS may be used to construct resilient networks and        file servers.

Redundancy can be woven into the DDS fabric to create always-availablenetworks, and redundancy incorporated into file servers can ensure thecontinuous availability of file data. DDS transparently overcomesnetwork failures whenever redundant paths make it possible to do so.

-   -   Protocol—DDS appears to be just another local filesystem, using        the same filesystem API as native local filesystems. DDS extends        the native filesystem API to provide a remote file access        capability that is almost indistinguishable from the access        capabilities afforded to local files.

DDS Terminology

The following list of words and phrases used throughout this documenthave the following definitions:

-   -   channel, integrated channel—In the context of a single DDS node,        a channel is DDS's main data structure for representing a remote        file and all information related to that file. The channel data        structure contains a number of smaller data structures either        directly or indirectly (by containing a reference to the smaller        data structure). The file attributes data structure and the file        data extent structures are referenced by the channel using        memory address pointers.    -    In the context of a client terminator site communicating across        the network with a server terminator site (possibly through        several intermediate sites), channel refers to the channels at        the individual sites bound together by DDS Protocol into a        single integrated channel.    -    NOTE: When an application, executing on a DDS configured origin        server, is accessing a file within the origin server the channel        for that file will simultaneously fulfill the roles of both        client terminator site and server terminator site. A channel        does not always span multiple DDS sites.    -   file data extent—a contiguous memory segment that holds file        data. The size of the file data extent is set following a        negotiation with a downstream site when the channel is created.        The file data extent structure contains a shadow pointer, which        is the memory address of a contiguous memory segment (the shadow        extent) of the same size as the file data extent.    -   policy attributes—file attributes attached to a file by a domain        manager as the file's data is sent upstream in the response to a        file access request. At upstream sites, these attributes,        associated with the file in the same manner as the file's        “normal” attributes, instruct the upstream site on the        procedures required for granting access, performing decryption        and all other file handling operations. (Only the upstream sites        that have been authenticated by a downstream site will be        trusted by the downstream site.)    -   external request—a file access request using a file access        protocol other that DDS. NFS, CIFS, UFS, EXT4 and NTFS are        examples of external requests. Note that NFS and CIFS are        network protocols and the others are “local” protocols (used        when DDS is installed on the system generating the request).    -   internal request—a file access request using the DDS protocol.        Internal requests are internal to DDS and flow exclusively        between DDS sites. Internal requests use the DDS protocol.    -   client terminator site or client terminator—The DDS site that        receives an external request from a client.    -   server terminator site or server terminator—The DDS site        “closest” to the origin file server. When the origin server is        DDS configured the server terminator site is the origin server.        In other cases, the server terminator communicates with the        origin server using a network protocol such as NFS or CIFS.    -   intermediate site—A DDS site in the chain linking the client        terminator site to the server terminator site.    -   client system, client computer or just client—In the context of        DDS processing a request, the client is the computer system that        dispatched the external request to the DDS client terminator        site.    -   upstream site—When two DDS sites are communicating, the site        “closest” to the client is the upstream site.    -   downstream site—When two DDS sites are communicating, the site        “closest” to the origin server is the downstream site.    -   client-side—With respect to any point along the integrated        channel path from server terminator to client terminator,        client-side refers to everything on the client side of that        point.    -   server-side—With respect to any point along the integrated        channel path from server terminator to client terminator,        server-side refers to everything on the server side of that        point.    -   origin file server or origin server—The file server exporting        the filesystem containing the target file.    -   DDS site or DDS node—a DDS configured network node that provides        a file proxy cache service.

DDS Operations

A single DDS module contains client-side code for requesting file datafrom an origin file server and server-side code that receives andresponds to requests from “upstream” DDS sites. Within the DDSframework, “downstream” is towards the origin server and “upstream” istowards the client.

An overview of a typical DDS network operation is:

-   -   a client computer system issues an NFS file access request        targeting a DDS portal,    -   the request is received at the portal (a DDS configured Linux        system) and routed to an nfs daemon (the native NFS server        code),    -   the NFS server code executes a read system call to read file        data,    -   the Linux vfs layer routes the system call into the DDS module        (which has registered as a local filesystem),    -   the call's file identifier parameter is used to identify and        connect to a channel (DDS's main data structure for representing        a file and all information related to that file),    -   the channel is examined to determine if all data required to        respond to the system call is cached within the channel; if so,        DDS responds to the system call; if not, . . . .    -   the channel is examined to determine the origin file server's        identity,    -   if the origin server is this node, DDS executes a read system        call to fetch whatever additional file data is required from the        underlying native filesystem to respond to the request from the        NFS server code,    -   if the origin server is some other node, DDS generates and        dispatches a DDS_LOAD request targeting a downstream DDS site        “closer” to the origin server,    -   the DDS_LOAD request may ripple through multiple DDS        intermediate sites (executing essentially the same procedure as        outlined above) before arriving at the DDS server terminator        site,    -   DDS executes a read system call to fetch whatever additional        file data is required from the underlying native filesystem to        respond to the request from an upstream DDS site,    -   the response propagates back upstream and eventually arrives at        the DDS client terminator site,    -   the file data contained in the response is attached to the        channel structure,    -   all data required to respond to the NFS system call is now        cached within the channel, so DDS responds to the call from the        NFS server code,    -   the NFS server code responds to the request from the client        computer system.

DDS Protocol

The DDS protocol defines two remote procedures for transporting filedata: DDS_LOAD and DDS_FLUSH. These two procedures are briefly describedsince recovery operations are based upon variations of these procedures.

DDS_LOAD—This operation loads data and metadata from a downstream site.The request includes a file identifier and the flags(DDS_CC_SITE_READING, DDS_CC_SITE_WRITING) that inform the downstreamsite of the types of operations that will be performed upon the filedata being loaded. These flags are used by the distributed consistencymechanism to keep track of the type of operations (read vs. write) beingperformed at upstream sites.

A single load or flush request may specify multiple file segments andeach segment may be up to 4 gigabytes in length.

The response includes flags (DDS_CC_SUSTAIN_DIR_PROJECTION, andDDS_CC_SUSTAIN_FILE_PROJECTION) that indicate whether the returned filedata and metadata may be cached or whether it must be discardedimmediately after responding to the current client request.

DDS_FLUSH—This operation flushes modified file data/metadata to someform of stable memory. A flush level specifies how far the flushpropagates. The currently defined levels are:

-   -   DDS_FLUSH_TO_STABLE_MEMORY—Flush to client terminator's flash        memory    -   DDS_FLUSH_TO_DISK—Flush to client terminator's disk    -   DDS_FLUSH_TO_ORIGIN—Flush all the way to the origin server

In response to DDS_LOAD requests, the DDS server terminator siteprojects file data into remote DDS client terminator sites. Theseprojections are sustained in the remote DDS sites while the file isbeing accessed at those sites unless a concurrent write sharingcondition arises.

An upstream DDS cache buffer is no different than an internal originfile server buffer. After a write operation modifies a file systembuffer (either local or remote), performance is enhanced if the bufferis asynchronously written to the server's disk. However, filemodifications are safeguarded when they are synchronously written todisk or some other form of stable storage. Flush levels allow both theclient and the server to express their level of paranoia. The moreparanoid of the two usually prevails.

As disclosed in this document, an upstream DDS site flushes all of achannel's dirty data and metadata downstream as an atomic unit. When theamount of dirty data is more than can be accommodated in a singlenetwork operation, the downstream site remains “committed” to theupstream site until the last batch of data (flagged withDDS_FLUSH_DATA_COMPLETE) is successfully received. This means that thechannel at the downstream site will not service a request from any otherupstream site until the flush has completed.

This works fine as long as everything else works fine. But, when twosites get partitioned in the midst of a multi-transfer flush operation,the client-side and the server-side will both attempt to overcome thefailure. But, at some point the server-side may decide (based on itscurrent policies) to cut off the isolated upstream site and continueproviding file access services to its other client systems. In thiscase, the flush operation is less than atomic. And this is unacceptablebecause file modifications must be atomic at all times and under allcircumstances.

Within DDS each file extent structure contains a pointer to a shadowextent, and each attribute structure contains a pointer to a set ofshadow attributes. When a multi-transfer flush is processed at adownstream node, all incoming data is routed into these shadowstructures. Then, when all dirty data (extents and attributes) has beenreceived at the downstream site, the shadow structures are promoted toreality in an atomic operation and the ‘old’ structures are released. Ofcourse, when the multi-transfer flush does not complete successfully,all shadow extents and the shadow attributes must be discarded.

At an upstream node, when DDS flushes a channel, each dirty extent(flagged with X_DIRTY) is flagged with X_FLUSHING. If the flushoperation does not complete successfully, DDS resets all X_FLUSHINGflags. And, of course, when the operation is successful both the X_DIRTYand X_FLUSHING flags are reset.

The DDS client-side node processes a multi-transfer flush operation inan atomic manner. Once the channel has been acquired and the first batchhas been accepted at the downstream site (as opposed to rejected becauseof a consistency operation), it will not be released until the lastbatch has been dispatched. And when released, the channel will either beclean (successful flush) or it will be just as dirty as it ever was.

In addition to the two remote procedures used to move file data up anddown the wire, DDS also defines:

-   -   DDS_CONNECT for establishing a connection to a filesystem        (equivalent to an NFS mount operation) or a connection to a        directory or a file,    -   DDS_NAME creates, modifies and deletes file/directory names and        links,    -   DDS_CTRL provides various capabilities required to actively        monitor the health of DDS domain nodes and to support the DDS        consistency mechanism. The two DDS_CTRL procedures that support        DDS's distributed consistency mechanism are:        -   fast ping—dispatched frequently by a DDS client site to            ensure that is still in communication with a downstream            site. The fast ping rate brackets the amount of time that a            client site can operate before becoming aware that it is            disconnected from its downstream counterpart. This rate is            typically set to about one second, but could be much higher            when DDS nodes are interconnected with extremely fast links            and/or shared memory.        -    A downstream site is fast pinged only when there has been            no other communication with the site for the amount of time            specified by the fast ping rate. Any successful message            exchange with a downstream site serves the purpose of a fast            ping.        -   slow ping—issued by a DDS client site as a “self-addressed            stamped envelope” (SASE) that the DDS server-side node uses            when it wants to deliver a consistency notification message.            The server-side node will not respond to this request until            it has a notification it wants delivered to the DDS client            node. Thus, the name “slow ping”.        -    slow ping is the means by which DDS implements a callback            mechanism for consistency control operations.

I. DDS Network Operations

This section presents a simplified overview of DDS networkcommunications.

DDS nodes communicate using multi-threaded SUNRPC remote procedure callsover TCP/IP connections. For every successful remote procedure callthere is a client making the call and a server responding to the call.SUNRPC & TCP/IP have built in mechanisms to reliably transport requestsand responses across a network. DDS, layered on top of the SUNRPC andTCP/IP combination, depends upon this protocol stack for reliablemessage delivery.

Every DDS remote procedure call issued eventually returns with anindication of the status of the call.

-   -   RPC_SUCCESS indicates that both the request and the response        were successfully transported across the wire.    -   RPC_TIMEDOUT indicates that a response was not received. This        occurs when a network link or node (including a DDS node) has        gone offline or failed. When alternate paths make it possible to        still access the source file, the upstream DDS node        (client-side) re-routes and re-issues the request to        transparently overcome the network failure.    -   Other RPC_XXXX error codes should not occur. But, when they        occur some administrative action is probably required.

High Level View of Network Failures

From a workstation or a DDS client node's perspective a componentfailure manifests itself as a failure to respond to a request. To theDDS client node it does not really matter whether a router, switch,intermediate DDS node or a server component failed. What does matter isthat the client issued a request and did not receive a response. An“industry standard” NFS client would, in this circumstance, keepre-issuing the same request until the server responded and then theclient would proceed as normal. (A DDS client is more proactive in thissituation, and this is described later.)

There is a class of failures referred to as a network partition event,where both DDS client-side nodes and server-side nodes remainoperational, but the failed component, has isolated the client-side fromthe server-side. When a network partition event occurs, the client-sideand server-side components assume very different roles.

The server-side's main priority is to ensure the integrity of all filedata and then to continue providing file access services to clientsstill able to communicate with the server.

The client-side's priorities are: a) to safeguard file modificationsthat have not yet been successfully flushed to the origin server; b) tore-establish communication with the server and immediately flush allfiles ‘crossing’ the partition; and c) to continue providing file accessservices if possible.

So each side plays a different role during a partition event. However,each role is tempered and shaped by DDS domain policy attributes. Theseattributes provide instructions for handling file data, processing fileaccess requests, and responding to failures. The following section,Centralized Control over Distributed Operations, describes how policyattributes are employed to exercise centralized control overgeographically distributed DDS overlay networks.

Centralized Control over Distributed Operations

Whenever a DDS origin server responds to a file access request, thefile's policy attributes are fed into the DDS distribution network as aclass of metadata associated with the file data throughout the network.Every DDS node faithfully adheres to all policies specified by thefile's policy attributes under all circumstances. (Of course, DDS nodesemploy standard authentication methods to ensure that secured data isonly sent to nodes that can be trusted.)

File metadata, including policy attributes, is provided with the samelevel of consistency as regular file data. Therefore, a metadata readoperation (to fetch file attributes and/or policies) at any DDS sitewill return the most recently written metadata. This means the policiesfor handling a file or a group of files can be changed instantlythroughout the network.

II. Network Failure Recovery

DDS Failure Recovery Building Blocks

DDS network operations incorporate the following features andcharacteristics designed to facilitate the transparent recovery fromnetwork component failures:

1. Network Transactions

A DDS node communicates with other DDS nodes at the network transactionlevel. A network transaction, usually consisting of a singlerequest-response interaction, uses whatever number or remote procedurecalls are required to complete an atomic DDS operation. For example,some DDS_FLUSH operations require multiple request-response interactionsto perform a flush as an atomic DDS operation.

At the completion of any network transaction, the server-side DDSchannel targeted by the request either “steps completely forward” to anew state or it remains unchanged from its original state.

At the completion of any network transaction, the client-side DDSchannel issuing the request either “steps completely forward” to a newstate or it remains unchanged from its original state.

2. Idempotent Operations

DDS servers incorporate a duplicate request cache (DRC) that enables theserver to receive the same request multiple times and to ensure allresponses will be the same as the first response.

Note that the network transaction feature ensures that each DDS nodeeither “does” or “completely does not” respond to a client request. Butit is possible, even likely, that the server can “do” while theclient-side “completely does not” because a network failure preventedthe delivery of the response. DDS's idempotent operations featureprovides a graceful (and transparent!) method for the client-side tocatch up with the server-side.

3. Recursive Ascent Failure Notifications

FIG. 7, depicting a simple hierarchy 700 of DDS nodes configured withmultiple routes from node A to the origin server node (node S), assumesthe following scenario for illustrative purposes:

-   -   node A has dispatched a DDS_LOAD request to node C, causing    -   node C to dispatch a DDS_LOAD request to node F, causing    -   node F to dispatch a DDS_LOAD request to node S, but    -   the link to node S has failed.

FIG. 8 is a flowchart illustrating an example of a method 800 ofrecursive ascent failure notifications. The method 800 can be used in ahierarchy 700 of FIG. 7. In the method 800, a failure notification(DDS_BAD_LINK) percolates up to a higher level only after the reconnectefforts at the current level failed to re-establish a downstreamconnection.

FIG. 8 shows that the method 800 can include declaring 802 a networkfailure. I.e., the DDS client node whose immediate downstream link hasstopped responding is the site that declares 802 a network failure. Forexample, referring to FIG. 7, when the link from node F to node S failswhile node C is attempting to load data from node S, node F will be thenode that detects and declares 802 the network failure.

FIG. 8 also shows that the method 800 can include determining 804 if analternate route is available. Determining 804 if an alternate route isavailable can include referencing network configuration data stored atthe node that is declaring a network failure 802 or searching thedirectory tree for connections by other nodes to the target node.Additionally or alternatively, determining 804 if an alternate route isavailable can include communicating with other connected nodes todetermine if a path to the target node exists. One of skill in the artwill appreciate that such a request does not include a request to theoriginating node. I.e., node F of FIG. 7 will not search for analternate path through node C, which sent the data request to node F.

FIG. 8 further shows that the method 800 can include reporting 806 anerror code if an alternate route is not available. E.g., since node Fhas no alternate routes to node S, node F will respond to node C'sDDS_LOAD request with and error code of DDS_BAD_LINK. Now, node Cattempts to reconnect on an alternate path (node E in FIG. 7) and if itfails to do so, node C will respond to node A with a status ofDDS_BAD_LINK. I.e., node C will complete the same method 800 beingutilized by node F. Finally, node A will attempt to reconnect on analternate path and if it fails to do so, it will respond to its clientwith a status of DDS_BAD_LINK.

FIG. 8 additionally shows that the method 800 can includere-establishing 808 the file connection over the alternative route iffound. I.e., since node F has no alternate working route to node S, itresponds to node C with DDS_BAD_LINK. Node C then attempts tore-establish a file connection 808 to node S through node E.

Finally, FIG. 8 shows that when the attempt to re-establish a fileconnection 808 is successful, node C re-sends 810 the same DDS_LOADrequest to node E that was originally sent to node F. If node E respondswith DDS_OK, node C will respond to node A with DDS_OK.

One of skill in the art will appreciate that once node C originallydispatched it's DDS_LOAD request (which it had generated in response tohaving received a DDS_LOAD request from node A), node C was willing towait for node F to respond because node C's fast pings kept reassuringit that node F and the link to it were both operational.

One skilled in the art will appreciate that, for this and otherprocesses and methods disclosed herein, the functions performed in theprocesses and methods may be implemented in differing order.Furthermore, the outlined steps and operations are only provided asexamples, and some of the steps and operations may be optional, combinedinto fewer steps and operations, or expanded into additional steps andoperations without detracting from the essence of the disclosedembodiments.

4. Atomic Flush Transactions

For all DDS network requests except DDS_FLUSH, a DDS channel “stepsforward” one request-response cycle at a time. At any point in time, theexternal view of a channel's state transitions from its state before arequest is processed to its state after the request is successfullyprocessed in an atomic manner. The external view (as opposed to internalview which is what the code processing requests “sees”) of a channel'sstate will never reflect a partially completed network request. It willreflect the channel's state either before or after a request has beenprocessed, and never anything in-between. A channel may be required toengage with its downstream channel, but, if so, this communicationconstitutes a separate request-response transaction.

However, to maintain the highest levels of file consistency whereby eachDDS request either “happens completely” or “does not happen at all”, DDSflushes are atomic and synchronous from the DDS client terminator siteall the way through to the origin filesystem. For large multi-transferflushes, intermediate nodes may be simultaneously forwarding transfer nwhile receiving transfer n+1.

An upstream node does not consider a flush successful until it receivesa DDS_OK response with the ALL_DATA_RECEIVED response flag also set.Flush data is maintained in shadow extents at upstream nodes until theupstream node receives confirmation that the origin filesystem hassuccessfully received all flush data. For large multi-transfer flushes,each transfer flows through intermediate nodes independently of theother transfers. All nodes store the received flush data in shadowextents. When the server terminator site receives the last flushtransfer (DATA_COMPLETE request flag is set), it writes all receivedflush data to the origin filesystem. After receiving confirmation thatthe write was successful, the server terminator site replaces allextents that have shadow extents with their shadows and then dispatchesa DDS_OK response with the ALL_DATA_RECEIVED response flag also set backupstream. As the response propagates through each intermediate node,each node also promotes its shadow extents.

FIG. 9 is a flowchart illustrating an example of a method 900 of atomicflush transactions. In the method 900 DDS flushes are atomic andsynchronous from the client terminator site to the server terminatorsite.

FIG. 9 shows that the method 900 can include receiving 902 a flushrequest from an upstream node. I.e., modified file data/metadata isreceived 902 with the intent that the modified file data/metadata issaved to some form of stable memory, as described below.

FIG. 9 additionally shows that the method 900 can include storing 904the received flushed data in shadow extents. I.e., the data received 902in the flush request from the upstream node is stored in shadow extents.

FIG. 9 also shows that the method 900 can include determining 906 ifthis site is the server terminator site.

When this site is the server terminator site, FIG. 9 shows that themethod 900 can include determining 908 if all batches of the flush(multi-transfer flushes have more than one) have been received and theDATA_COMPLETE request flag was set in one of the requests. (Note that itis not uncommon for network requests to be received and/or processed outof sequence with respect to when they were dispatched by the upstreamnode.)

When all flush data has been received and the DATA_COMPLETE request flagis set, FIG. 9 shows that the method 900 can include determining 910 ifsynchronous writes are enabled for the file identified in the flushrequest and when synchronous writes are enabled, writing 912 allreceived flush data to the origin filesystem.

FIG. 9 then shows that the method 900 can include determining 914whether the filesystem write operation completed successfully. When awrite error occurs, FIG. 9 shows that the method 900 can includeresponding 916 to the flush request with a status that conveys the writeerror code.

When a write error does not occur or synchronous writes are not enabledfor the file identified in the flush request, FIG. 9 shows that themethod 900 can include setting 918 the ALL_DATA_RECEIVED response flag;promoting 920 the flushed data in the shadow extents; and responding 922to the received flush request with a status code of DDS_OK. Furthermore,when the DATA_COMPLETE request flag is not set, FIG. 9 shows that themethod 900 can include simply responding 922 to the received flushrequest with a status code of DDS_OK.

When this site is not the server terminator site, FIG. 9 shows that themethod 900 can include forwarding 924 the flush request to a downstreamnode. I.e., the site forwards 924 the flush request to ensure that theflush request continues until it reaches the server terminator site.

FIG. 9 further shows that the method 900 can include receiving 926 aresponse to the flush request from the downstream node. I.e., becausethe flush request was forwarded 924, the site waits for confirmationthat the downstream site (and all sites to which the downstream siteforwarded the request) has responded to the flush request.

FIG. 9 further shows that the method 900 can include determining 928 ifthe response status is DDS_OK. I.e., the site determines 928 that thedownstream site (and all sites to which the downstream site forwardedthe request) has successfully completed the flush request.

When the response status is not DDS_OK (the flush request was notsuccessfully processed downstream), FIG. 9 further shows that the method900 can include decrementing 930 a retry count and, if the count isgreater than zero, re-forwarding 924 the flush request along the sameroute as the previous request. However, if the retry count equals zero,FIG. 9 further shows that the method 900 can include declaring 932 anetwork failure, which will initiate a search for an alternate route tothe server terminator site.

When the response status is DDS_OK, FIG. 9 further shows that the method900 can include determining 934 if the ALL_DATA_RECEIVED response flagis set, indicating that the server terminator site has received andaccepted all flush data.

When the ALL_DATA_RECEIVED response flag is set, FIG. 9 further showsthat the method 900 can include promoting 920 the flushed data in theshadow extents. All extents that have shadow extents are replaced withtheir shadows. I.e., the changes are made permanent and the data is nolonger stored within a shadow extent. Any further access of the filewill receive the updated file, rather than the pre-update file.

FIG. 9 additionally shows that the method 900 can include responding 922to the received flush request with a code of DDS_OK. As the responsepropagates through each DDS node, each node also promotes its shadowextents if the ALL_DATA_RECEIVED response flag is also set. This ensuresthat each node is working from the same concurrent data. The response,therefore, propagates upstream until the originating node receives it,thereby becoming aware that the flush request has been completedsuccessfully.

5. Distributed Consistency Mechanism

DDS's consistency mechanism is woven into the DDS Protocol. Every DDSfile access request provides an indication of whether the client intendsto modify the returned file data or just read it.

DDS implements an internal rule: a DDS client site must first inform itsdownstream DDS site before it performs a new type of activity(read/write) on a file. So, for example, a client node that haspreviously fetched a complete file for reading cannot begin writingwithout first informing the downstream site that it intends to beginwriting. This allows the server-side to detect the onset of a concurrentwrite sharing (CWS) condition and take whatever steps are necessary tomaintain cache consistency before the client-side actually performs thewrite operation.

The consistency mechanism ensures that, at any instant, only a singleDDS site is modifying a file. Therefore, at the moment when a site withdirty data (modified data) becomes partitioned from its downstreamserver-side counterpart, all other sites are guaranteed to not have anyfile modifications. The isolated site holds the most recent filemodifications and all portals still connected to the server terminatorsite provide a consistent view of the file “just before” the latest (andnow isolated) file write. The isolated site can flush the filemodifications to any DDS site providing connectivity to the source file,including the server terminator site itself.

6. “Versioned” File Modifications

6.1. Validating a Cached File Image

Modern computer systems typically employ internal clocks with nanosecondor microsecond resolutions. When DDS code executes on any computersystem other than the origin server, the clock used by remote DDS nodescannot be synchronized with the clock used by the origin filesystem'scode to the level of precision required to support standard filesystemoperations. So, filesystem timestamps may only be set at the originserver site.

DDS file access responses always include the target file's attributes,which are cached and stored in association with the file's data. Thefile's last modification timestamp, a file attribute element, is used asa version number for cached images. At DDS sites the cached image of afile's last modification timestamp may be referred to as the file'sversion number. The two are the same. The differing nomenclature relatesto how the attribute is interpreted and used at DDS sites.

Upstream DDS sites use a file's timestamp (a file attribute), thetimestamp arrival time and a delta time to generate and maintainprojected timestamps. Projected timestamps, which enable DDS upstreamsites to operate autonomously (the main point of file caching), aretemporary timestamps that are replaced with (upgraded to) genuinefilesystem timestamps whenever a server terminator site accesses filedata in the origin filesystem.

The elements of a projected timestamp are:

-   -   timestamp—a cached image of a file timestamp (for Unix-like        systems: atime—time of last access, mtime—time of last        modification, ctime—time of last status change),    -   timestamp arrival time—the time at which a particular timestamp        is received in a response from a DDS downstream site, and    -   delta time—the difference between the current time and the        timestamp arrival time.

Filesystems native to Unix-like systems typically maintain threetimestamps for each file:

-   -   a_time—time of last access,    -   m_time—time of last modification, and    -   c_time—time of last status change.        DDS upstream sites generate projected timestamps for each of        these filesystem timestamps and therefore maintain three arrival        times and three delta times: a_arrival/a_delta,        m_arrival/m_delta and c_arrival/c_delta. The description of        projected timestamps in the remainder of this document focuses        on the generation of projected timestamps for the time of last        modification. However, similar methods are used to generate        projected timestamps for a_time and c_time.

When a response to a flush request with status DDS_OK is received andprocessed, the last modification timestamp contained in the responsebecomes the version number, the m_arrival time is set to the currenttime, and the m_delta time is set to zero. Whenever a response to anyother type of DDS file access request is received, the version number iscompared with the cached one. When they differ, the client knows thefile has been modified at some other site and its image is no longervalid.

FIG. 10 is a flowchart illustrating an example of a method 1000 ofvalidating a cached file image when a response received at an upstreamsite contains the same version number as the one associated with thecached file image; and invalidating the cached file image, storing thenew version number, setting the m_arrival time to the host system'scurrent time, and resetting the m_delta time to zero when a responsecontains a different version number than the one associated with thecached file image.

FIG. 10 moreover shows that the method 1000 can include dispatching 1002a DDS file access request to the downstream DDS site. I.e., the nodesends a request to read, write, save, etc. a file from a downstreamsite.

FIG. 10 also shows that the method 1000 can include receiving 1004 aresponse to the DDS downstream request. I.e., the node receives anyresponse from the downstream site, regardless of whether the access wassuccessful or not.

FIG. 10 further shows that the method 1000 can include determining 1006if the DDS request was processed without any errors and the responsestatus is therefore DDS_OK. And, when there is an error, that therequest is repeated. However, what is not depicted is that after a fewunsuccessful re-attempts the upstream site will declare a networkfailure and may begin searching for an alternate route to the DDS serverterminator site.

FIG. 10 further shows that the method 1000 can include determining 1008whether the received version number is the same as the version numberassociated with the cached file image. Note that the response to anysuccessful DDS file access request (response status is DDS_OK) conveysthe file's attributes, which include the version number (m_time, thelast modification timestamp).

FIG. 10 also shows that the method 1000 can include continuing to use1010 the current cached file image (revalidating the current file image)when the received version number is the same as the version numberassociated with the cached file image. I.e., because this site is usingthe current version, the current file image is revalidated and iscontinued to be used.

FIG. 10 additionally shows that when the received version number is notthe same as the version number associated with the cached file image,the method 1000 can include storing 1012 the new version number, settingthe m_arrival time to the host system's current time, and resetting them_delta time to zero.

FIG. 10 further shows that the method 1000 can include determining 1014whether the response is a flush response. And when the response is not aflush response the method 1000 can include discarding 1016 the currentcached file image. A DDS flush request flows through to the origin fileserver in a single atomic operation and the response returns the lastmodification timestamp as the file's version number. So, a file'sversion number always changes when a flush response with a status ofDDS_OK is received. (When the version number changes on any otherresponse, the file has been modified at some other site and the cacheimage at this site is therefore not current and must be discarded.)

FIG. 10 further shows that the method 1000 can include determining 1018if the response has the CACHING_ENABLED flag set and caching 1020 anynew response data and metadata when the CACHING_ENABLED flag is set.

6.2. Flushing a Modified File Image

Upstream DDS sites provide better (faster) write performance when clientfile modifications are captured at the site and not immediately flusheddownstream. “Collecting” many client file modifications before sending a“batch flush” downstream employs the underlying network infrastructurefar more efficiently and provides a more responsive file access serviceto the client. However, when a DDS client terminator site acknowledges aclient write request before the new data has been successfully flusheddownstream, the possibility arises that a future network failure couldcause the new data to be lost.

DDS provides a distributed file service and is therefore alwaysbalancing performance against filesystem integrity. Performance isincreased when DDS client terminator sites operate autonomously, but therisk of losing file modifications is also increased. Both theadministrator and the user can set or adjust policies affecting thisbalancing act.

The most risk adverse policy instructs client terminator sites toimmediately forward file modifications on to the server terminator siteand to not respond to a client write request until the origin serveracknowledges the successful receipt of the new data. This mode, referredto as synchronous writes, traverses the full network path from clientapplication to origin server on every write operation.

More risky, performance oriented policies allow file modifications to beaggregated and batch flushed. This mode is often referred to as delayedwrites. Generally, a timer, controlled by policy, initiates batchflushes in this mode of operation.

When modified file data is flushed, the file data flowing downstreamincludes the version number. It will be the same as the version numberat the DDS server terminator site and all intermediate DDS nodes unlesssome previous network partition event prevented consistency controlmessages from being delivered.

Once a DDS client site detects that it is isolated from its server-sidecompanion (it employs fast pings to detect this quickly), it immediatelyinvokes procedures to re-establish communication with a DDS node still“connected” to the server terminator site. Once the isolated, and nowparanoid, node reconnects, it usually begins immediately flushing alldirty file images.

Each flush request is tagged with the cached file image's versionnumber. Whenever a DDS node processes a flush request, it compares theversion number in the flush request with the version number associatedwith its cached image of the file identified in the flush request. Whenthere is a mismatch the flush is rejected and sent back to the clientwith an error code of OUT_OF_SEQUENCE. This error code is then returnedto the client application, which will have to resolve this issue.

Version numbers will usually match and the incoming file modificationswill be accepted. However, when a substantial amount of time passeswhile the client node is processing file requests using only cached fileimage data (no communication with the downstream site), the likelihoodof a version mismatch increases. Of course, when a DDS site processesfile requests independently of its downstream site, it is relying uponDDS's consistency callback mechanism for immediate notification when aconcurrent write sharing condition arises and it is relying on fastpings to continually reassure itself that the callback path isoperational.

FIG. 11 is a flowchart illustrating an example of a method 1100 offlushing modified file data downstream towards the DDS server terminatorsite; revalidating the modified file data at each DDS site; and, whenall flush data has arrived at the server terminator site, possiblywriting all file modifications to the origin filesystem and thenfetching the file's last modification timestamp.

FIG. 11 shows that the method 1100 can include receiving and servicing1102 a request from a client application and updating the appropriatedelta time, which is m_delta for write requests, a_delta for readrequests and c_delta for requests that modify the file's attributes.

FIG. 11 also shows that the method 1100 can include determining 1104 ifit is time to flush file modifications downstream. If it is not time toflush the file modifications downstream, the site continues to serviceclient file access requests. When synchronous writes are being used forthe file identified in the request, it will always be time to flush themodifications downstream. For delayed writes, flushes may be initiatedwhen a delay timer expires. The delay timer may be reset on every writerequest or every file access request. So, for example, a clientterminator site might initiate a batch flush operation 15 seconds afterreceiving the last of many write requests.

FIG. 11 further shows that the method 1100 can include flushing 1106 allfile modifications downstream. The file's version number accompanies theflush data.

FIG. 11 additionally shows that the method 1100 can include a downstreamDDS node receiving 1108 the flush request and comparing the receivedversion number with the version number associated with the downstreamnode's cached file image.

FIG. 11 moreover shows that the method 1100 can include determining 1110if the version numbers differ. I.e., the version numbers are compared.

FIG. 11 also shows that the method 1100 can include the downstream DDSnode responding 1112 OUT_OF_SEQUENCE to the upstream node if the versionnumbers differ.

When the version numbers are the same, FIG. 11 then shows that themethod 1100 can include the storing 1114 the received flush data inshadow extents.

FIG. 11 further shows that the method 1100 can include determining 1116if the downstream DDS node is the server terminator site. If thedownstream DDS node is not the server terminator site then the steps1106-1116 are repeated until the server terminator site is reached.

FIG. 11 also shows that when the flush request is processed at theserver terminator site, the method 1100 can include determining 1118that all requests in this flush operation have been received and thatone of the requests had the DATA_COMPLETE request flag set.

When not all requests in this flush operation have been received withone of the requests having the DATA_COMPLETE request flag set, FIG. 11shows that the method 1100 can include responding 1120 to the upstreamnode with a status of DDS_OK.

FIG. 11 additionally shows that the method 1100 can include determining1122 whether synchronous write mode is being used for the fileidentified in the request. When synchronous write mode is not beingused, FIG. 11 shows that the method 1100 can include responding 1120 tothe upstream node with a status of DDS_OK.

When synchronous write mode is being used, FIG. 11 shows that the method1100 can include writing 1124 all file modifications to the underlyingorigin filesystem.

FIG. 11 further shows that the method 1100 can include determining 1126whether the write to the origin filesystem was successful. When thewrite to the origin filesystem is not successful, FIG. 11 shows that themethod 1100 can include responding 1128 to upstream node with a statusthat indicates the type of error that occurred.

When the write to the origin filesystem is successful, FIG. 11 showsthat the method 1100 can include setting 1130 the ALL_DATA_RECEIVEDresponse flag and replacing all extents that have shadow extents withtheir respective shadow extents.

FIG. 11 then shows that the method 1100 can include fetching the file'slast modification timestamp from the origin filesystem. This timestamp,which conveys the time (according to the origin filesystem clock) thatthe write 1124 was performed, will be interpreted at upstream sites asthe file's version number.

FIG. 11 finally shows that the method 1100 can include responding 1120to upstream node with a status of DDS_OK.

6.3. Generating Timestamps at Client Terminator Sites

When a client system (a workstation, for example) receives a response toa file write request, the response includes file's attributes, anelement of which is the file's last modification timestamp. Forsynchronous writes, this timestamp will be the correct timestampgenerated by the origin filesystem when it received the new filemodification. However, when delayed writes are employed for fasterperformance and the DDS client terminator site is operatingautonomously, the DDS instance executing at the client terminator sitegenerates the response's last modification timestamp as follows:

last modification timestamp=version number+m_delta time;

where:

m_delta time=current time−m_arrival time.

The version number is the origin server's last modification timestampand only the origin server updates it. Upstream sites use the lastmodification timestamp as the file's version number. Every filemodification performed by the origin server creates a new file version.

The timestamp generated by the client terminator is a temporarytimestamp that is accurate enough to enable the client application tobelieve it is accessing the most current version of the file (which, infact, it is). The timestamp monotonically increases by a reasonableamount on every file modification request and it is periodicallyresynchronized (whenever dirty file data is flushed downstream) with theorigin server's timestamp. This behavior helps to maintain the illusionthat the DDS service is provided by a single local filesystem “within”the computer where the client application is executing.

Similar procedures may also be used to project other temporaryfilesystem timestamps such as a file's last access timestamp and lastchange timestamp.

7. Path Signatures

The path signature, included in some DDS responses, defines the currentroute back to the origin server. Path signatures have the followingstructure:

typedef struct dds_path_signature {   int n_hops; // number of DDS hopsto server site   long node_signature[16]; // 4 byte node identifier of aDDS node } DDS_PATH_SIGNATURE;

Whenever an upstream DDS node receives a successful connection response(DDS_OK status and the INITIAL_FILE_CONNECTION response flag is set),the path signature contained within the response is copied into thechannel.

During the course of processing a request from an upstream sitecurrently not connected to the target file, the downstream site willestablish a connection to the upstream site and then include in itsrequest response a path signature constructed by adding its signature tothe channel's path signature.

At any instant, a channel's path signature reflects the last successfulpath used to access file data.

When a network failure occurs DDS initiates procedures to re-establish aconnection to a file. A client-side node will successively direct areconnection request down each of its alternate paths to the originserver; stopping as soon as one of the reconnection attempts issuccessful. After trying all paths without successfully reconnecting tothe file, the client-side node will return the DDS_BAD_LINK error codeto the client that was accessing the file when the failure wasdiscovered.

A reconnection request, which is a DDS request with the DDS_RECONNECTflag set, contains the channel's path signature. The server-side nodereceiving and processing the request uses the path signature torecognize the client as a current client and to reconcile itsconsistency control data structures. So, for example, when a writingclient reconnects, the server-side node does not “see” a second writerand declare a concurrent write sharing condition. It “sees” a clientthat has been modifying a file now attempting to access the file througha new path.

A reconnection request propagates downstream until a node recognizesthat its signature is contained in the path signature. If this node ismore than one level lower in the cache hierarchy, the upstream nodesalong the previous path must be informed so that the old path'sconsistency control structures can be reconciled.

A reconnection request may also be issued when there is no networkfailure. An upstream node, deciding to re-balance its downstreamtraffic, may issue a reconnection request at any time.

7.1. Path Signature Upstream Propagation

FIG. 12 is a flowchart illustrating an example of a method 1200 employedby a DDS node to load a path signature when a connection is established;and for that node to add its signature to the path signature that itsends upstream whenever an upstream node establishes a file connection.

FIG. 12 illustrates that the method 1200 can include receiving 1202 aDDS request. I.e., a DDS request is received from an upstream node.

FIG. 12 also shows that the method 1200 can include acquiring 1204 thechannel for the file identified in the request. When the channel doesnot already exist, a new channel is created and assigned to theidentified file.

FIG. 12 further illustrates that the method 1200 can include determining1206 if it is possible to service the request without communicating withthe downstream node. This would be the case when all required file datais cached and valid at the site, and the channel at this site is already“connected” to a downstream site or this site is operating indisconnected mode.

When a downstream communication is required, FIG. 12 also shows that themethod 1200 can include dispatching 1208 a request to a downstream siteand then receiving 1210 a response from the downstream site.

FIG. 12 then shows that the method 1200 can include determining 1212 ifthe response contained an indication that the downstream site justestablished a file connection to this site.

When the downstream site indicates that it has just established a fileconnection to this site, FIG. 12 shows that the method 1200 can includecopying 1214 the path signature from the response to the channel.

When 1206 determines that a downstream communication is not required,FIG. 12 shows that the method 1200 can include servicing 1216 therequest.

Then FIG. 12 illustrates that the method 1200 can include determining1218 if the upstream site that sent the request now being processed isestablishing an initial file connection.

When the upstream site is establishing an initial file connection, FIG.12 shows that the method 1200 can include copying 1220 the pathsignature from the channel to the response, adding this node's signatureto the response's path signature, and then setting theINITIAL_FILE_CONNECTION flag in the response.

Finally, FIG. 12 shows that the method 1200 can include dispatching 1222a response back to the upstream DDS client.

7.2. Path Signature Based Reconnection

FIG. 13 is a flowchart illustrating an example of a method 1300 employedto reconnect an upstream DDS node.

FIG. 13 illustrates that the method 1300 can include receiving 1302 aDDS request from an upstream site.

FIG. 13 also shows that the method 1300 can include acquiring 1304 thechannel for the file identified in the request. When the channel doesnot already exist, a new channel is created and assigned to theidentified file.

FIG. 13 further illustrates that the method 1300 can include determining1306 if the DDS_RECONNECT flag is set in the request. When theDDS_RECONNECT flag is not set, the procedure for reconciling theupstream site structures at this site (and possibly sites above) isbypassed.

FIG. 13 then shows that when the DDS_RECONNECT flag is set the method1300 can include determining 1308 if the path signature contained in therequest identifies this node as a member of the “old” path. When thenode is not a member of the “old” path, FIG. 13 illustrates that themethod 1300 can include resetting 1310 the channel's attributes valid(ATTRS_VALID) flag to force this node “go downstream” to fetch validattributes.

FIG. 13 shows that when this node is a member of the “old” path themethod 1300 can include sending 1312 a DDS RECALL or INVALIDATE messageto the “old” upstream site identified in the path signature.

FIG. 13 also shows that the method 1300 can include determining 1314 ifa response to the DDS RECALL or INVALIDATE message is received.

FIG. 13 shows that when a response is received the method 1300 caninclude updating 1316 the upstream site structure of the “old” upstreamsite identified in the path signature.

FIG. 13 further illustrates that when a response is not received themethod 1300 can include declaring 1318 the “old” upstream siteidentified in the path signature to be OFFLINE and recording this statusby setting the OFFLINE flag in the upstream site structure of the “old”upstream site.

FIG. 13 finally shows that the method 1300 can include servicing 1320the DDS request and then dispatching a response back to the upstream DDSnode.

Note that this procedure may be repeated at multiple intermediate sitesuntil a site on the “old” path is encountered. Resetting the ATTRS_VALIDflag (step 1310 of FIG. 13) continually pushes a DDS_RECONNECT requestfurther downstream until a site currently connected to the origin serveris encountered. DDS_RECONNECT requests contain the full path signatureof the upstream client node requesting the reconnection.

8. Any Port in a Storm

The DDS consistency mechanism ensures that all DDS portals areequivalent. The any port in a storm feature allows a DDS client node toswitch to a new downstream node at any time. So, whenever a client nodefeels isolated it can elect to find a new partner.

The new partner may be at the same hierarchical level as the partitionedsite, or it may be at any level closer to the server terminator site.When the new partner is at the same hierarchical level as thepartitioned site, it must, in addition to responding to the request,send a message downstream informing that site that an upstream site hasswitched partners. This allows the downstream site to revoke whateverpermissions it had granted the isolated site (isolated from the clientattempting to reconnect and possibly also isolated from this site) andgrant permissions along this new path.

Note that the any port in a storm feature, which has a downstreamorientation, is only possible because of the consistency feature thathas an upstream orientation. These two features work together to providea highly consistent file access service layered on top of inherentlyunreliable networks such as the Internet.

-   -   The any port in a storm feature provides the resiliency required        for extremely reliable communications.    -   The DDS consistency mechanism is completely dependent on        extremely reliable communications.    -   The any port in a storm feature is completely dependent on the        DDS consistency mechanism.

Neither feature, the DDS consistency mechanism or any port in a storm,can stand alone. But, when intertwined, they provide the solidfoundation required for providing very strong consistency guaranteesover highly distributed, unreliable networks.

9. Filehandles are Forever

DDS filehandles, patterned after NFS filehandles, are permanentfilehandles. The file server generates an opaque file identifier duringthe processing of a lookup request and passes it back to a client. Theclient then uses this filehandle as a reference point in futureread/write requests.

There is no timeout on the validity of a DDS (or NFS) filehandle. It isvalid forever. The client system may present a filehandle received tenyears ago (and not used since) and the file server must connect to thesame file or return the error STALE_FILEHANDLE if the file no longerexists.

U.S. patent application Ser. No. 12/558,482 (Nomadic File Systems)discloses the construction of globally unique permanent filehandles.

-   -   However, a remote NFS client accessing the same file would        receive a file handle that uniquely identified the file forever.        An NFS client may present a file handle that it received ten        years ago and hasn't used since back to the file server and that        server must either establish a connection with the original file        or respond with an error indicating that the file handle is no        longer valid. (A file handle is another type of object ID.)    -   A method commonly used by Unix based NFS file servers to create        a permanent file id is to concatenate two 32 bit numbers, the        inode number and the inode generation number, to create a 64 bit        file id. Since each time an inode is assigned to a new file its        generation number is incremented, an inode would have to be        re-used over 4 billion times before a file id of this type could        repeat. These 64 bit file ids are essentially good forever.

Permanent filehandles facilitate network error recovery operations byreducing to a bare minimum the amount of distributed state required fora disconnected DDS client site to successfully reconnect with the originserver. A DDS client needs nothing more than a filehandle to reconnectto a file. The DDS client does not need to know what directory, whatfilesystem or even what file server.

Server-side Operations

When a concurrent write sharing condition arises, DDS server nodes mustRECALL a modified upstream image projection (if there is one, and therecan be only one) or INVALIDATE all upstream image projections supportingaccelerated read operations.

An INVALIDATE or RECALL operation proceeds as follows:

-   -   A concurrent write sharing condition (multiple clients active on        a file and at least one of them is writing) is detected at the        onset of processing a file access request.    -   Notifications are prepared for each upstream site except the one        that sent the request that precipitated this CWS condition. Each        note contains two elements: a) a file identifier and b) an        opcode (RECALL or INVALIDATE).    -   Each notification is dispatched to an upstream site using a self        addressed stamped envelope (SASE) that the upstream site had        previously sent to the downstream site.    -   The upstream site responds by dispatching a DDS_FLUSH request.        At a minimum, this request contains a flag indicating that this        request is an acknowledgement to the RECALL/INVALIDATE        notification. In the case of a RECALL, the request also contains        all file modifications.    -   After all upstream sites have responded, the downstream site        proceeds with processing the original request.

When an upstream site is partitioned from its downstream site, theupstream site never receives the notification. So, the downstream sitenever receives confirmation that the upstream site has invalidated itscached image of the file.

When this occurs the downstream site may decide (depending on thecurrently established policies) to give up on the isolated site andcontinue servicing other clients. So, the downstream site may declarethe upstream site OFFLINE and record that status in the upstream sitestructure (uss) associated with the upstream site. Then the downstreamsite, based on the established policy, may do one of the following:

-   -   The downstream site waits for communications to be        re-established, receives confirmation from all upstream sites,        and finally proceeds with processing the original request.    -   The downstream site proceeds with processing the original        request. In this case, the isolated site is sidelined so that        the server-side site can continue servicing the other client        sites. The sidelined site will be dealt with and re-integrated        when communications are re-established.

When communications are restored, the upstream site will promptly send aSASE. If the downstream site had chosen to wait it will quickly bouncethe SASE back with the same note it had sent earlier.

During a network partition event, all DDS nodes on the server-side ofthe partition have a primary responsibility to protect and ensure theintegrity of all DDS filesystem content. DDS must never lose or mangledata once that data has been successfully written to a DDS node.

The primary directive for DDS server-side nodes is therefore:

Never take any action that can possibly result in the loss or corruptionof filesystem data.

So, during a network partition event, DDS server-side nodes actreflexively to ensure filesystem integrity. Then, throughout thepartition event, the server-side nodes continue providing file servicesin conformance with the primary directive. However, these server-sidenodes are pledged to filesystem integrity and have no real obligationsto any client. DDS server-side nodes will quickly refuse any requestthat is not consistent with the PD.

Client-Side Operations

Although the DDS server-side of a network partition is not dedicated toservicing client systems, the other side of the partition is completelyfocused providing the best client service possible given the currentcircumstances. In particular, DDS client-side nodes are responsible forensuring that no delayed write data (file modifications that have notyet been flushed to the origin) is ever lost.

The domain policy attributes in effect at each client-side nodedetermine what actions the node performs when a network partition isdetected, but a client-side node will generally safeguard its data firstand then attempt to reconnect to the server-side and flush all filemodifications back to their respective origin servers.

DDS client nodes continuously monitor the health and performance of alldownstream paths, and for every path the client node is aware of allalternate paths. The statistics maintained for each path include:

-   -   running average of bytes/second,    -   average response time latency, and    -   uptime percentage.

When a DDS client node issues a request and fails to receive a responsewithin a timeout window (typically 2 or 3 seconds), the node retransmitsthe request several times before declaring the downstream node OFFLINE.

When the downstream site is declared OFFLINE, the DDS client nodesimultaneously tries all alternate paths and selects the path with thebest performance.

After all alternates have been tried, and none have been successful, DDSsyncs all modifications to files from the disconnected downstream serverto the site's stable memory (hard disk or flash memory). When downstreamcommunications are restored, all file modifications are immediatelyflushed downstream.

Once all file modifications have been secured in stable memory, DDSeither:

-   -   continues attempting to establish communication along original        path and all alternate paths; or    -   returns an error indication to the client workstation.

In addition to performing whatever downstream communications arerequired to support the file access services the upstream site isproviding, the upstream site also pings the downstream site once asecond (typically). This enables the upstream site to quickly detect apartition event and, depending on the established policies, possiblystop providing access to cached files affected by the partition event.

When the upstream site fails to receive a reply to its ping, itretransmits the request several times before declaring the downstreamnode OFFLINE.

The upstream site will continue to periodically ping the downstreamsite. When a response is finally received, the site is declared ONLINEand both sites may begin re-synchronizing.

Site re-synchronization basically consists of re-syncing a series ofindividual files. And this is a straightforward process with oneexception: an out of sequence write (which is explained in the followingsection).

Out of Sequence Writes

In the case where an upstream site has been modifying the file foo whena partition event occurs, it is possible that:

-   -   a CWS condition can arise, and    -   the RECALL message is not delivered, and    -   the upstream site modifies its image of foo before    -   the upstream site has detected the partition event via its fast        ping polling.

At some point, the downstream site:

-   -   finally gives up on the upstreamer that has not responded to the        RECALL notification,    -   marks the upstreamer OFFLINE, and    -   moves on with providing other clients with access to foo.

Then another client modifies foo.

Finally, the network is fixed and the upstreamer comes back ONLINE. Whenthe upstreamer attempts to flush foo downstream: the downstream siteMUST detect that the modification performed at the upstream site did notuse the most current version of foo, and the request must be rejectedwith error code OUT_OF_SEQUENCE.

III. Recovery Procedures and Recovery Routines

Server-side nodes assume a rather passive role during a network failure.These nodes must receive file modifications that are being flusheddownstream, but that is also what they do when the network is notbroken. The main difference during a network failure is when aserver-side node receives a flush with a path signature indicating anupstream site is reconnecting, it executes a recovery routine totransfer read/write permissions from the isolated upstream site to thesite that sent this flush.

Client-side nodes bear almost the entire burden of recovering fromnetwork failures because they have service commitments to active clientsthat must be maintained. The client-side node “closest to” a failednetwork component will be the node that detects the failure and declaresthe network partitioned. The node will then execute a recoveryprocedure.

The recovery procedure will sequentially call upon various recoveryroutines, which are described in the following section.

Client-Side Recovery Routines

int dds_monitor_ds_paths(CHANNEL *)

This routine is continuously executed by a thread dedicated tomonitoring all downstream paths and maintaining performance statisticsassociated with each path. When a failure occurs, the informationgathered by dds_monitor_ds_paths( ) may be referenced to quicklydetermine the best alternate path.

dds_monitor_ds_paths( ) maintains a path_state mini-database of alldownstream paths emanating from this node. This routine may alsoperiodically send its current path_state to a node that maintains andpresents a global view of network operations.

int dds_reconnect(CHANNEL *cp, int mx)

This routine re-establishes a connection to the origin server for thefile identified by the filehandle contained within the channel (cp->fh).dds_reconnect( ) is optimized for the speed of re-establishing adownstream connection because DDS nodes really don't like to bedisconnected. This routine may reference the path_state mini-database toquickly select the most appropriate alternate path for the reconnectionattempt.

A dds_reconnect( ) request always contains the issuing site's pathsignature and it always propagates through to the first node that iscommon to both the ‘old’ and the ‘new’ paths.

int dds_flush(CHANNEL *cp)

A dds_flush( ) request flows from a DDS client terminator site to aserver terminator site as a single atomic network transaction. For largemulti-transfer flushes, intermediate nodes may be simultaneouslyforwarding transfer n while receiving transfer n+1.

DDS flushes are atomic and synchronous. A flush is not consideredsuccessful until the client receives the server's OK response. A shadowcopy is kept until a node receives an OK response from its immediatedownstream site.

For large multi-transfer flushes, each transfer flows throughintermediate nodes independently of the other transfers. All nodes storethe flush data in shadow extents. When the server terminator sitereceives the last flush, it moves all shadow extents into the sunlightand dispatches an OK response back upstream. As the response propagatesthrough each intermediate node, each node also sunlights its shadowextents.

Server-Side Recovery Routines

int dds_transfer_permissions(CHANNEL *cp, int to_mx, int from_mx)

This routine RECALLS or INVALIDATES the file image at the upstream node‘from_mx’ and simultaneously transfers whatever permissions node‘from_mx’ had to node ‘to_mx’. Node ‘from_mx’ may respond to theRECALL/INVALIDATE message or not. This server-side node does not reallycare. It has marked node ‘from_mx’ OFFLINE (for this file) and willhandle all reintegration issues later when node ‘from_mx” attempts toreconnect this file.

int dds_recall(CHANNEL *cp, int mx)

This routine RECALLS or INVALIDATES the file image at an upstream node.

IV. DDS Failure Recovery Procedure

When a network component fails client-side nodes detect the failure anddrive the transparent recovery procedure, during which the DDSfilesystem client remains completely unaware of the failure.

The DDS failure recovery procedure operates in the following manner:

-   -   1. A client-side node dispatches a request downstream and is        waiting for a response. If a response is not received within a        timeout period, control will be returned to the thread that        issued the request with an indication that the downstream site        failed to respond.    -   2. The request thread calls dds_monitor_ds_paths(cp) to        determine whether the path to the downstream site is        operational. This routine, constantly fast pinging the        downstream site, is the authority with respect to determining        link status.    -   3. If the link is still good, the thread will just re-issue the        request again. This request will have the same message        identifier (xid) as the original. If the downstream site did        respond to the previous request, the response to this request is        guaranteed to be the same response (thanks to the duplicate        request cache).    -   4. If the link is down, the request thread references the        path_state mini-database and selects the ‘best’ alternate path        and then calls dds_reconnect(cp, mx) to re-establish a        connection to the DDS server-side network. (Or alternatively,        this thread may spawn a bunch of threads that would        simultaneously attempt to re-establish connections to the DDS        server-side network. Then, when several of the attempts are        successful, one of the paths would be selected and the others        could be disconnected or just allowed to atrophy.)    -   5. If the reconnect attempt is not successful, the request        thread responds to the request (that it had originally received        from an upstream client) with an error status of DDS_BAD_LINK.        The upstream site will treat this error status the same as a        TIMEOUT. So, the request thread at the upstream site will start        executing step 2 above.    -   6. If the reconnect attempt is successful, the request thread        immediately flushes all dirty data downstream. This flush        operates as previously described: it is atomic and synchronous        all the way to the server terminator site.    -   7. Once all file modifications have been secured, DDS operations        switch into “normal” mode. The filesystem client is still        accessing the file, but the path has been changed. The client        remains unaware of the change.

Reintegration of Isolated Nodes

When an isolated network segment “comes back online”, isolated nodesautomatically re-synchronize their images of all files “crossing” thelink that had failed.

Channels containing modified data have the highest priority. Each isflushed downstream. The server-side will accept or reject each flush onan individual basis. A flush will only be rejected if the file wasmodified somewhere else while this client-side node was partitioned. Theclient-side node must have a means of handling a flush rejection. Thisprobably includes a) saving the dirty data that was just rejected, andb) notifying the user that there is a conflict that must be resolved.

The reintegration process is essentially complete after all dirty fileshave been flushed. Files that were being read before the partition eventoccurred can now be read again. No special processing is required.Cached file images will be updated on a demand basis if the source filehas been modified since the image was fetched.

The present invention may be embodied in other specific forms withoutdeparting from its spirit or essential characteristics. The describedembodiments are to be considered in all respects only as illustrativeand not restrictive. The scope of the invention is, therefore, indicatedby the appended claims rather than by the foregoing description. Allchanges which come within the meaning and range of equivalency of theclaims are to be embraced within their scope.

What is claimed is:
 1. In a computing system where a data request hasbeen passed between a file service proxy cache node and a downstreamsite, the file service proxy cache node being a network node locatedbetween a client system and the origin file system node, anon-transitory computer-readable storage medium including instructionsthat, when executed by the file service proxy cache node, performs thesteps: dispatching a file access request to the downstream site;receiving a response to the file access request, wherein the responseincludes: a version number of a file image cached at the downstreamsite; comparing the version number of the file image cached at thedownstream site to a version number of a file image cached at the fileservice proxy cache node; if the version numbers are the same:continuing to use the file image cached at the file service proxy cachenode; and if the version numbers are different: setting the versionnumber of the file image cached at the file service proxy cache node tothe version number of the file image cached at the downstream site;setting an arrival time to the file service proxy cache node's currenttime; and resetting a delta time to zero.
 2. The system of claim 1,wherein the file service proxy cache node includes an origin fileserver.
 3. The system of claim 1, wherein the version number includes alast modification timestamp.
 4. The system of claim 3, wherein comparingthe version number of the file image cached at the downstream site tothe version number of the file image cached at the file service proxycache node includes comparing the last modification timestamp of thefile image cached at the downstream site to the last modificationtimestamp of the file image cached at the file service proxy cache node.5. The system of claim 1 further comprising: if the connection to thedownstream site is lost: reconnecting to the downstream site over thesame path.
 6. The system of claim 1 further comprising: if theconnection to the downstream site is lost: reconnecting to thedownstream site over a different path.
 7. The system of claim 1 furthercomprising: if the connection to the downstream site is lost: flushingall file images that had been fetched through the failed path.
 8. Thesystem of claim 7 further comprising: including with the flush requestthe version number of the file image cached at the file service proxycache node.
 9. In a computing system where a data request has beenpassed between a file service proxy cache node and a downstream site,the file service proxy cache node being a network node located between aclient system and the origin file system node, a non-transitorycomputer-readable storage medium including instructions that, whenexecuted by the file service proxy cache node, performs the steps:dispatching a file access request to the downstream site; receiving aresponse to the file access request, wherein the response includes: aversion number of a file image cached at the downstream site; comparingthe version number of the file image cached at the downstream site to aversion number of a file image cached at the file service proxy cachenode; if the version numbers are the same: continuing to use the fileimage cached at the file service proxy cache node; if the versionnumbers are different: setting the version number of the file imagecached at the file service proxy cache node to the version number of thefile image cached at the downstream site; setting an arrival time to thefile service proxy cache node's current time; and resetting a delta timeto zero; and if the response to the file access request is not a flushresponse: discarding the current cached file image.
 10. The system ofclaim 9 further comprising: determining if the file access request wasprocessed by the downstream site without any errors.
 11. The system ofclaim 10 further comprising: if the file access request was notprocessed by the downstream site without any errors, repeating therequest to the downstream site.
 12. The system of claim 10 furthercomprising: if the file access request was not processed by thedownstream site without any errors, dispatching a file access request toa second downstream site.
 13. In a computing system where a flushrequest has been received at a file service proxy cache node from anupstream file service proxy cache node, the file service proxy cachenode being a network node located between a client system and the originfile system node, a non-transitory computer-readable storage mediumincluding instructions that, when executed by the file service proxycache node, performs the steps: receiving a flush request from anupstream file service proxy cache node; comparing the version number ofthe file image cached at the upstream site to the version number of thefile image cached at the file service proxy cache node; and if theversion numbers do not differ: storing the received flush data in shadowextents; determining if the file service proxy cache node is a serverterminator site; if the file service proxy cache node is a serverterminator site: if all flush batches have been received and at leastone of the flush batches has a data complete request flag set:determining whether synchronous write mode is being used for the fileidentified in the request; and if synchronous write mode is being used:writing all file modifications to the underlying origin filesystem;determining whether the write to the origin filesystem is successful;and if the write to the origin filesystem is successful:  setting an alldata received response flag and replacing all extents that have shadowextents with their respective shadow extents;  fetching the file's lastmodification timestamp from the origin filesystem; and  responding tothe received flush request with a status code that indicates thesuccessful completion of the request; and if all flush batches have notbeen received: responding to the received flush request with a statuscode that indicates the successful completion of the request.
 14. Thesystem of claim 13 further comprising: if the file service proxy cachenode is not a server terminator site: the file service proxy cache nodeflushing all file modifications downstream.
 15. The system of claim 13further comprising: if synchronous write mode is not being used:responding to the received flush request with a status code thatindicates the successful completion of the request.
 16. The system ofclaim 13 further comprising: if the write to the origin filesystem isnot successful: responding to the upstream node with a status thatindicates the type of error that occurred.